+++ title = "HTML script " +++ Using R and R Markdown to Visualize the Coronavirus Pandemic

Using R and R Markdown to Visualize the Coronavirus Pandemic

Danica Cortez

2020-28-04

For this project, I used the following packages:

  • glue
  • dplyr
  • tidyverse
  • lubridate
  • readr
  • viridis
  • plotly
  • scales
  • ggplotlyExtra
  • rjson
  • moments
  • quantreg
  • caret
  • kableExtra

I suggest running this code so that all your numbers are not in scientific notation:

For this project I used 4 data sets. The first one has details about coronavirus cases in the USA by county, which would be the equivalent of NUTS 3 regions. This comes from the New York Times github

The second one has socio-economic and demographic data about the US by county which was taken from the Center for Disease Control.

The third data set has information about the number and characteristics of each airport by county, which I took from the Federal Aviation Administration.

Finally, the fourth data set was downloaded from Scott on Technology.

I cleaned as well as joined the second and third data sets in Excel into one.

Introduction to R Markdown

R markdown is a great tool that you can use to integrate many different characteristics of your analysis, such as interactive visualizations or server-side interactions.

These are just some of the things you can create with R markdown:

  • pdfs
  • docs
  • html
  • dashboards
  • presentations
  • websites

Yes, that’s right - you can make presentations directly from R! This can be really helpful if you want to show code and output without having to copy and paste a bunch of things. Check out this blog for more information on R Markdown formats and tutorials.

R markdown can be used alongside WordPress and can be used to display Quick LaTeX. The easiest way to understand LaTeX is that it enables you to write maths easily. For example:

\[ \frac{x_b}{x_a} \] \[ \sum_{j=0}^{i=0}{\frac{z_i^j}{i+100^i}} \]

You can also render images but I’ll leave the explanation up to the experts. LaTeX rules are super easy to follow so don’t be afraid to try!

Preparing the Data

Before diving into any analysis, it is necessary to clean the data. While everyone has their own process, it is helpful for me when I already have an idea of what I want to do. I came up with three goals for this analysis:

  1. Perform an exploratory analysis
  2. Create interactive visualizations
  3. Play with a couple of models for the data

In the NYT data set, there were a couple of geographical exceptions for which they did not use a traditional “fips” county codes - which are just unique identification numbers for countries. For example, they created a fictitious “New York City” that would act as a county for itself. I instead chose to impute these numbers into a county that didn’t have any cases reported yet - Manhattan - in order to be able to visualize them using this geographical identifier (the fips).

Exploratory Analysis

In order to perform and EDA, I created a couple of tables that summarized characteristics of the coronavirus data set.

First, it we will look at the information about the variables in the analysis. I merged the four data sets together for cases only on the 13th of April.

Variable Description
date Date
county Name of county
state State of county
fips Federal Information Processing Standard Publication, was a five-digit fips code which uniquely identified counties and county equivalents in the United States, certain U.S. possessions, and certain freely associated states
cases Confirmed cases are patients who test positive for the coronavirus
deaths The cumulative number of confirmed cases and deaths as reported that day in that county or state
diabetes_perc Diagnosed Diabetes, Age-Adjusted Percentage, 20+, 2016
nohsdip_perc Percentage without High School Diploma, Ages 25+, 2013-2017 (5-year)
femalehd_perc Families with Female Head of Household (%), 2013-2017 (5-year)
foodstmp_perc Percentage Food Stamp/Supplemental Nutrition Assistance Program Recipients, 2016
med_home_val_000s Median Home Value (in thousands of $), 2013-2017 (5-year)
med_hh_inc_000s Median Household Income (in thousands of $), 2017
gini Income Inequality (Gini Index), 2013-2017 (5-year)
perc_poverty Percentage Living in Poverty, All Ages, 2017
unemp_rt Unemployment Rate, Ages 16+, 2018
pop65up_perc Population Aged 65 and Older (%), 2013-2017 (5-year)
airpm2.5conc Annual Average Ambient Concentrations of PM2.5, 2015
perc_sev_housing Percentage of Households Living with Severe Housing Problems, 2010-2014 (5 year)
urban_rural Urban-Rural Status, 2014
hospitals Number of Hospitals, 2017
prim_phys Population per Primary Care Physician (in thousands), 2017
perc_wo_ins_und65 Percentage without Health Insurance, Under Age 65, 2017
cost_percap_medicare_hd Cost of Care per Capita for Medicare Beneficiaries Diagnosed with Heart Disease, Total, 2017
hd_among_ins_perc Prevalence of Diagnosed Heart Disease Among Medicare Beneficiaries, 2017
cardio_deathrt_per_hundthou_ov35 Total Cardiovascular Disease Death Rate per 100,000, 35+, All Races/Ethnicities, Both Genders, 2014-2017
hyperten_deathrt_perhundthou_over35 Hypertension Death Rate per 100,000 (any mention), 35+, All Races/Ethnicities, Both Genders, 2014-2017
population Total Population, 2013-2017 (5-year)
airport Number of attended airports in the county

Here, you can see the first and last cases included in this data set from the time that I downloaded it.

First and Last Case Reported
2020-01-21
2020-04-13

Next, you can see some information about the coronavirus data set.

Number of Observations in the Data Set Unique Counties (imputed Manhattan) Unique States
56541 2681 55
Here, you can see the top 10 counties that have the most cumulative coronavirus deaths as of 13-04-2020.
date county state fips cases deaths
2020-04-13 New York City New York 36061 106764 7154
2020-04-13 Nassau New York 36059 24358 1109
2020-04-13 Wayne Michigan 26163 11648 760
2020-04-13 Westchester New York 36119 19785 610
2020-04-13 Suffolk New York 36103 21643 580
2020-04-13 Cook Illinois 17031 15474 543
2020-04-13 Bergen New Jersey 34003 10092 482
2020-04-13 Essex New Jersey 34013 7634 433
2020-04-13 Oakland Michigan 26125 5073 347
2020-04-13 Los Angeles California 6037 9420 320
Here, you can see the top 10 counties that have the most cumulative coronavirus cases as of 13-04-2020.
date county state fips cases deaths
2020-04-13 New York City New York 36061 106764 7154
2020-04-13 Nassau New York 36059 24358 1109
2020-04-13 Suffolk New York 36103 21643 580
2020-04-13 Westchester New York 36119 19785 610
2020-04-13 Cook Illinois 17031 15474 543
2020-04-13 Wayne Michigan 26163 11648 760
2020-04-13 Bergen New Jersey 34003 10092 482
2020-04-13 Los Angeles California 6037 9420 320
2020-04-13 Rockland New York 36087 7965 182
2020-04-13 Hudson New Jersey 34017 7879 236

Creating Interactive Data Visualizations

Based on both a lecture I had the other day on R, Python and Julia and reading things online, I have come to the conclusion that data visualizations using base R versus packages like ggplot are used on two different occasions.

Base R can be recommended to be used when you want to ensure the longevity of your code. Meaning, because base R remains the same, you can be sure of your code functioning on any computer, for years to come.

One of the disadvantages of base R is, of course, the fact that you won’t get the same level of creativity as you do with R packages.

On the flip side, packages can be recommended to be used when you’re not so concerned about longevity so much as the breadth of what you can accomplish. Meaning, if you have a one-off project like this or need to create some one-off report, R packages can offer you a lot more creativity.

The disadvantage of using packages is that it will be subject to package updates or even packages becoming obsolete in the future.

Creating Interactive Graphs

To make the following interactive graphs and maps, I am using two packages:

  • ggplot2
  • plotly

If you want to see some basic interactive charts, which where the inspiration for the following ones, check out R Graph Gallery’s section on interactive charts

Here, you can see that the number of unique counties reporting at least 1 coronavirus case has steadily grown. However, at the latest date, this growth has finally slowed down.

In this graph, you can see the cumulative number of cases reported. As of April 13th, this number is still growing and the United States currently has the most COVID-19 cases in the world.

Creating Interactive Maps

Creating maps in R can be done in both base R and by using packages. Choosing to use packages can add a level of interactivity that can help your users better understand and explore your data. For the following interactive graphs, I continued to use the plotly package. Some other packages you can to create widgets and interactive maps use include:

  • leaflet
  • tmap
  • shiny
  • mapview

If you want to create a basic interactive map, you can start with R Graph Gallery’s section on interactive maps.

If you want to see some of the crazy details you can add to your visualizations, check out Geocomputation with R’s chapter on making maps in R

The basics of creating any map include two vital elements:

  1. Data with geospatial information: in my case, I am using two letter state codes and fips county codes
  2. A geospatial file such as GeoJSON: in my case, I downloaded it from plotly’s github

The geospatial file contains information that will help create your map, such as geometric info and info on latitudes and longitudes. It is important that your data include some sort of geospatial information, such as country or regional codes, because it acts as a bridge between your geospatial file and your data file.

In the following map, you can see the reported cumulative cases for each state. As you can see in the legend, the brighter the color, the higher the amount of cumulative cases a state has reported.

Below, you can see two tables presenting the above information in tabular form. New York and New Jersey have the highest cases. You may have noticed that these states are white in the above graph - this is because their numbers where so high that not even converting the data into percentiles accurately depicted the range of cases in the US.

Meaning, because NY and NJ have such a high number of reported cases, it made it very difficult to find an appropriate way to display it with the rest of the data. I’m still working on experimenting with my plotly code to get two different maps to one so I can present NY and NJ in a grey color with a different legend.

State Cumulative Cases Rank
New York 195031 1
New Jersey 64584 2
Massachusetts 26867 3
Michigan 25490 4
California 24334 5
Pennsylvania 24295 6
Illinois 22028 7
Louisiana 21016 8
Florida 21011 9
Texas 14488 10

Below, you can see the states with the lowest reported coronavirus cases as of 13 April.

State Cumulative Cases Rank
Northern Mariana Islands 11 1
Virgin Islands 51 2
Alaska 275 3
Wyoming 275 4
North Dakota 331 5
Montana 394 6
Hawaii 502 7
West Virginia 638 8
Maine 698 9
Guam 716 10

In this next map, you can see the amount of cases reported by counties. Like in the previous graph, the lighter the color, the higher amount of cases reported in the county. The main difference between this map and the previous one, besides the difference in geographical units, is that this map presents the range in percentiles.

This can often be a practical solution when you’re dealing with data sets with a wide range.

Modeling Coronavirus Cases

There are always multiple angles to take when analyzing a data set, especially one with so many variables. In the first part of this section, I will use two models to make predictions for the rest of the month’s coronavirus cases:

  • Logistic Growth Model
  • Log-Linear Model

In the second part, I try to see the relationship between socio-econnomic factors and cumulative cases as of the 13th of April.

Logistic Growth Model

Logistic growth models are used when there is rapid and increasing growth in the beginning stages of a phenomenon, but then decreases at a later stage when you start reaching some “carrying capacity.” I suggest checking out this page on Duke University’s website for a simple explanation and example of logistic growth models as well as the video some of you may have already seen dealing specifically with the pandemic.

Following the example of the code for the logistic growth model by this blog, I took the following as my model.

\[ Cases_{day(t)} = \dfrac{asymptote}{1 + e^{((midpoint - t)*scale)}} \]

I used a self starter function called SSLogis, as well as a nonlinear (weighted) least-squares (or NLS) model for this with the nls function in r. Documentation and more information for this package can be found on r documentation. My model estimated the asymptote, midpoint and scale of the logistic growth model and looks like this:

\[ Cumulative \; Cases_{day(t)} = \dfrac{\phi_{1}}{1 + e^{((\phi_{2} - t)*\phi_{3})}} \]

The self starting nls logistic model creates an initial estimate of the \(\phi\) parameters.

As we can see from the summary table, all parameters are statistically significant at the p<0 value.

## 
## Formula: cumulative.cases ~ SSlogis(daysince, phi1, phi2, phi3)
## 
## Parameters:
##         Estimate  Std. Error t value            Pr(>|t|)    
## phi1 698495.7505   9421.9639   74.14 <0.0000000000000002 ***
## phi2     76.2966      0.1705  447.45 <0.0000000000000002 ***
## phi3      5.1101      0.0754   67.77 <0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4360 on 81 degrees of freedom
## 
## Number of iterations to convergence: 0 
## Achieved convergence tolerance: 0.000002029

Next, we want to plot our model along with the actual data points. I managed to graph my model thanks to the code on this blog.

I chose not to make this plot interactive as the code for showing the equation clashed with plotly and I figured it would make my life easier to just show a static version. As you can see, the actual data points are in blue while the model fit is displayed by the red curve. The model maps the data almost perfectly.

Using this model, I predicted the cases up until the 13th of April. In other words, I tested my model by making an interpolation with all the dates available. Looking at the graph, the model performs very well for interpolation.

The next step, naturally, is to perform an extrapolation until April 21st. Because the NYT data set wasn’t updated to the 21st, I imputed values from the Worldometer’s infamous COVID-19 live tracker. While it is helpful to keep in mind that the NYT and Worldometer’s data sets can vary, it is clear that the model under predicts the actual cases.

Compared to the actual growth, the model seems to have predicted parameters that forecast a decrease in growth. In reality, these cases show no signs of slowing down.

Exponential Growth Model

In this part, I will try using a exponential growth model to predict coronavirus cases instead. This model is inspired by Toward Data Science’s page on modeling exponential growth on coronavirus cases.

An exponential growth model is pretty self-explanatory: it aims to model a variable with exponential growth. The equation of my model is as follows:

\[ x(t) = x_{0}*b^t \]

Where:

  • \(x(t)\) is the number of cases at time \(t\)
  • \(x_{0}\) is the initial number of cases
  • \(b\) is the growth factor, which is the number of people that are infected by each sick person

In order to coerce this formula into a linear expression, we simply take the logarithm of each variable so that the equation turns into:

\[ log \; x(t) = log(x_{0}) + log(b) * t \]

As you can see from the summary table, this linear model has a high \(R^2\) combined with parameters that are both significant at the p<0 value.

## 
## Call:
## lm(formula = caseslog ~ daysince, data = total.log.cases.byday)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.6377 -0.7885  0.1421  0.8513  1.5553 
## 
## Coefficients:
##              Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -0.975068   0.204840   -4.76           0.00000821 ***
## daysince     0.171532   0.004186   40.97 < 0.0000000000000002 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9303 on 82 degrees of freedom
## Multiple R-squared:  0.9534, Adjusted R-squared:  0.9529 
## F-statistic:  1679 on 1 and 82 DF,  p-value: < 0.00000000000000022

The next step is to use this model to predict the number of cases up to the 20th of April and see if it performs better than the logistic growth model.

Plotting the projection and the real data points together, we can see that this model over predicts the number of cases by a lot. As a result, we can see that none of these models fit quite right.

Needless to say, forecasting is already difficult but doing so from the epidemiology standpoint is even more so - at least in my opinion. At this point I would probably go back to the SSLogis model and play around with the parameters, making hypothesis for what they will be.

If you’re interested in looking at some projections that have been done for your country, check out Health Data’s projections until August.

Socio-Economic Model

What is interesting about any model is trying to tease out whether or not any other factors play a role in having a high number of coronavirus cases or deaths. While coronavirus cases are still growing right now, and therefore require models specific to pandemics, there will come a time when they stop growing.

In this socio-economic model, I’m curious to know if there are any shared characteristics between counties with low and high coronavirus cases. For this part, I will be treating the cumulative cases on the 13th of April as the final amount of coronavirus cases.

According to the World Health Organization, older people and people who have pre-existing medical conditions - asthma, diabetes, heart disease - are more at risk of contracting COVID-19.

Health isn’t the only factor involved in the spread of coronavirus. According to the L.A. Times, “black people continue to see the highest COVID-19 death rate in L.A. County.” Looking at the death rates per race as of the 26th of April, we can see a significant difference: 13 deaths per 100,000 for black people, 9.5 for Latinos, 7.5 for Asian people and 5.5 for white people.

Of course, being a certain race doesn’t make you more or less susceptible to the novel virus. Certain socio-economic factors can be the underlying cause of these disparate death rates, such as: owning insurance, household income, number of airports in your county, access to hospitals, etc.

There are many different ways you can play with this data. You can employ both parametric and non-parametric tests:

  • Cluster analysis
  • Factor analysis
  • Linear regression
  • Quantile regression

Here, I’ve chosen to simply perform the last two for simplicity. I won’t go into significant detail, this is more just for demonstration on the types of analysis you can do.

Simple Linear Regression

First, I check the spread of the variables, as one of the assumptions for linear regression is normality.

Correlation Coefficient with Cumulative Cases Variable Name
0.3164380 population
0.2814870 hospitals
0.2760621 med_home_val_000s
-0.1832606 urban_rural
0.1620762 airport
0.1475656 perc_sev_housing
0.1416452 med_hh_inc_000s
0.1346236 gini
-0.0646698 diabetes_perc
-0.0619975 cardio_deathrt_per_hundthou_ov35
-0.0578892 pop65up_perc
-0.0549134 prim_phys
-0.0479434 perc_wo_ins_und65
0.0432933 airpm2.5conc
-0.0316094 perc_poverty
0.0307291 femalehd_perc
-0.0287720 hyperten_deathrt_perhundthou_over35
0.0267720 cost_percap_medicare_hd
-0.0243113 nohsdip_perc
-0.0169363 foodstmp_perc
-0.0144535 unemp_rt
0.0026920 hd_among_ins_perc
Skewness
cases 38.0928215
population 13.5106668
hospitals 12.7503638
airport 5.7501017
prim_phys 5.0997760
med_home_val_000s 3.4016067
unemp_rt 1.9853510
hyperten_deathrt_perhundthou_over35 1.4120476
med_hh_inc_000s 1.3876523
perc_sev_housing 1.3632143
femalehd_perc 1.2155457
perc_poverty 1.0037907
foodstmp_perc 0.9494621
pop65up_perc 0.9247539
nohsdip_perc 0.9005759
perc_wo_ins_und65 0.7816133
cardio_deathrt_per_hundthou_ov35 0.7057460
cost_percap_medicare_hd 0.6428656
airpm2.5conc -0.4747358
gini 0.3984492
diabetes_perc 0.3452524
hd_among_ins_perc 0.0635327

You can go ahead and explore the cor, because there are a couple of variables that are highly skewed, I suggest we do a logarithmic transformation on them. The new correlation table is below.

Correlation Coefficient with Cumulative Cases Log of Variable
-0.6000246 urban_rural
0.5435196 airport
0.5258378 population
0.4903658 hospitals
0.4462501 med_home_val_000s
0.4161428 med_hh_inc_000s
0.3887019 perc_sev_housing
-0.3862815 pop65up_perc
0.2716570 airpm2.5conc
0.2533475 femalehd_perc
-0.2281880 prim_phys
-0.1713834 cardio_deathrt_per_hundthou_ov35
0.1651199 gini
-0.1546424 perc_poverty
-0.1540688 nohsdip_perc
-0.1282357 perc_wo_ins_und65
-0.0572528 hyperten_deathrt_perhundthou_over35
-0.0451831 diabetes_perc
-0.0316169 foodstmp_perc
-0.0268021 hd_among_ins_perc
-0.0267021 unemp_rt
0.0232935 cost_percap_medicare_hd

Now, you can see that all the variables are either moderately skewed or approximately symmetric. While I am worried about multi-collinearity, this is something that I’ll check afterwards.

Skewness
foodstmp_perc 0.9494621
pop65up_perc 0.9247539
nohsdip_perc 0.9005759
med_home_val_000s 0.7990006
perc_wo_ins_und65 0.7816133
cases 0.7094941
cardio_deathrt_per_hundthou_ov35 0.7057460
prim_phys 0.6748102
cost_percap_medicare_hd 0.6428656
population 0.5398575
airpm2.5conc -0.4747358
gini 0.3984492
med_hh_inc_000s 0.3965121
diabetes_perc 0.3452524
hyperten_deathrt_perhundthou_over35 -0.2749007
perc_poverty -0.2230960
perc_sev_housing -0.1926100
femalehd_perc -0.1646569
unemp_rt 0.0869987
hd_among_ins_perc 0.0635327
hospitals NaN
airport NaN
## 
## Call:
## lm(formula = cases ~ ., data = linfit1data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.1270 -0.6139 -0.0215  0.5718  3.4639 
## 
## Coefficients:
##                                         Estimate   Std. Error t value
## (Intercept)                         -20.24369815   1.51148432 -13.393
## population                            0.85059592   0.03604451  23.598
## hospitals                             0.12687856   0.05620829   2.257
## airport                               0.02510218   0.03780417   0.664
## prim_phys                            -0.02437483   0.04126718  -0.591
## med_home_val_000s                     0.22044494   0.10788088   2.043
## unemp_rt                             -0.16970540   0.08270335  -2.052
## hyperten_deathrt_perhundthou_over35  -0.25602421   0.05248973  -4.878
## med_hh_inc_000s                       1.95054027   0.31004955   6.291
## perc_sev_housing                      0.23464258   0.11911015   1.970
## femalehd_perc                         1.03526165   0.11319939   9.145
## perc_poverty                          0.38124217   0.16866368   2.260
## foodstmp_perc                        -0.00300026   0.00648071  -0.463
## pop65up_perc                          0.01046893   0.00635761   1.647
## nohsdip_perc                         -0.00566500   0.00568505  -0.996
## perc_wo_ins_und65                     0.00011427   0.00577856   0.020
## cardio_deathrt_per_hundthou_ov35      0.00099043   0.00033915   2.920
## cost_percap_medicare_hd               0.00001291   0.00001015   1.273
## gini                                  5.02035562   0.78481070   6.397
## diabetes_perc                         0.03010349   0.01517702   1.983
## hd_among_ins_perc                    -0.00600691   0.00564833  -1.063
## airpm2.5conc                          0.00705227   0.01482253   0.476
## urban_rural2                         -0.06928934   0.15730728  -0.440
## urban_rural3                         -0.30601285   0.15132883  -2.022
## urban_rural4                         -0.49783790   0.16160910  -3.081
##                                                 Pr(>|t|)    
## (Intercept)                         < 0.0000000000000002 ***
## population                          < 0.0000000000000002 ***
## hospitals                                        0.02408 *  
## airport                                          0.50675    
## prim_phys                                        0.55481    
## med_home_val_000s                                0.04112 *  
## unemp_rt                                         0.04028 *  
## hyperten_deathrt_perhundthou_over35       0.000001144641 ***
## med_hh_inc_000s                           0.000000000374 ***
## perc_sev_housing                                 0.04896 *  
## femalehd_perc                       < 0.0000000000000002 ***
## perc_poverty                                     0.02389 *  
## foodstmp_perc                                    0.64344    
## pop65up_perc                                     0.09976 .  
## nohsdip_perc                                     0.31912    
## perc_wo_ins_und65                                0.98423    
## cardio_deathrt_per_hundthou_ov35                 0.00353 ** 
## cost_percap_medicare_hd                          0.20331    
## gini                                      0.000000000190 ***
## diabetes_perc                                    0.04743 *  
## hd_among_ins_perc                                0.28767    
## airpm2.5conc                                     0.63427    
## urban_rural2                                     0.65964    
## urban_rural3                                     0.04327 *  
## urban_rural4                                     0.00209 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9616 on 2383 degrees of freedom
##   (254 observations deleted due to missingness)
## Multiple R-squared:  0.7489, Adjusted R-squared:  0.7463 
## F-statistic: 296.1 on 24 and 2383 DF,  p-value: < 0.00000000000000022

Here we see that the model has a great \(R^2\) value, although some of the variables are not statistically significant. Because we might have a problem with multicollinearity, we have to check out the VIF, or variance inflation factor.

##         GVIF Df GVIF^(1/(2*Df))                                vars
## 1  14.677516  1        3.831125                     med_hh_inc_000s
## 2  11.145020  1        3.338416                        perc_poverty
## 3   5.904324  1        2.429881                   med_home_val_000s
## 4   5.589864  1        2.364289                          population
## 5   5.302795  1        2.302780                       foodstmp_perc
## 6   3.795620  1        1.948235                       femalehd_perc
## 7   3.156536  1        1.776664                        nohsdip_perc
## 8   3.153826  3        1.210987                         urban_rural
## 9   3.067385  1        1.751395                       diabetes_perc
## 10  3.032982  1        1.741546    cardio_deathrt_per_hundthou_ov35
## 11  2.834033  1        1.683459                    perc_sev_housing
## 12  2.744162  1        1.656551                           hospitals
## 13  2.383997  1        1.544020                   hd_among_ins_perc
## 14  2.196092  1        1.481922                             airport
## 15  1.946701  1        1.395242                        airpm2.5conc
## 16  1.911237  1        1.382475                   perc_wo_ins_und65
## 17  1.873603  1        1.368796                                gini
## 18  1.861858  1        1.364499                            unemp_rt
## 19  1.854786  1        1.361905                        pop65up_perc
## 20  1.679301  1        1.295878                           prim_phys
## 21  1.489281  1        1.220361             cost_percap_medicare_hd
## 22  1.250564  1        1.118286 hyperten_deathrt_perhundthou_over35

Looking at the VIFs, we can see that some variables, like population or hospitals, have high VIFs. This makes sense, as they all have high correlation coefficients with one or more variables in the data set. You can check out the correlation matrix below.

There are many ways to deal with multicollinearity, however I choose to simply take some of these highly correlated variables out and run a second regression.

cases med_hh_inc_000s pop65up_perc hospitals prim_phys perc_wo_ins_und65 cost_percap_medicare_hd hd_among_ins_perc airport
cases 1.0000000 0.1416452 -0.0578892 0.2814870 -0.0549134 -0.0479434 0.0267720 0.0026920 0.1620762
med_hh_inc_000s 0.1416452 1.0000000 -0.2823681 0.1968010 -0.2001174 -0.3854992 -0.2680835 -0.3598292 0.2847773
pop65up_perc -0.0578892 -0.2823681 1.0000000 -0.1860082 0.0248048 -0.0279625 -0.1437373 -0.0548488 -0.1907273
hospitals 0.2814870 0.1968010 -0.1860082 1.0000000 -0.1975375 -0.0647762 0.0971480 -0.0448295 0.7733669
prim_phys -0.0549134 -0.2001174 0.0248048 -0.1975375 1.0000000 0.2015667 0.1390482 0.2031435 -0.2019914
perc_wo_ins_und65 -0.0479434 -0.3854992 -0.0279625 -0.0647762 0.2015667 1.0000000 0.2725076 0.2446784 -0.0909493
cost_percap_medicare_hd 0.0267720 -0.2680835 -0.1437373 0.0971480 0.1390482 0.2725076 1.0000000 0.3918527 0.0232562
hd_among_ins_perc 0.0026920 -0.3598292 -0.0548488 -0.0448295 0.2031435 0.2446784 0.3918527 1.0000000 -0.0547139
airport 0.1620762 0.2847773 -0.1907273 0.7733669 -0.2019914 -0.0909493 0.0232562 -0.0547139 1.0000000

Here, we can see a table with the new correlation matrix with the variables I chose to keep.

## 
## Call:
## lm(formula = log(cases) ~ ., data = linfit2data)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -12.9285  -0.9419   0.0232   0.9310   6.4516 
## 
## Coefficients:
##                             Estimate   Std. Error t value             Pr(>|t|)
## (Intercept)              0.846370562  0.424399829   1.994               0.0462
## med_hh_inc_000s          0.038925510  0.002757431  14.117 < 0.0000000000000002
## pop65up_perc            -0.096738731  0.007521974 -12.861 < 0.0000000000000002
## hospitals                0.093817992  0.016516824   5.680       0.000000015075
## prim_phys               -0.077948842  0.012504279  -6.234       0.000000000536
## perc_wo_ins_und65       -0.002650449  0.006907690  -0.384               0.7012
## cost_percap_medicare_hd  0.000002291  0.000014103   0.162               0.8709
## hd_among_ins_perc        0.035774401  0.006208671   5.762       0.000000009371
## airport                  0.159235175  0.012889401  12.354 < 0.0000000000000002
##                            
## (Intercept)             *  
## med_hh_inc_000s         ***
## pop65up_perc            ***
## hospitals               ***
## prim_phys               ***
## perc_wo_ins_und65          
## cost_percap_medicare_hd    
## hd_among_ins_perc       ***
## airport                 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.425 on 2412 degrees of freedom
##   (241 observations deleted due to missingness)
## Multiple R-squared:  0.4444, Adjusted R-squared:  0.4425 
## F-statistic: 241.1 on 8 and 2412 DF,  p-value: < 0.00000000000000022

This second linear regression shows an okay \(R^2\), significantly lower than our previous model, and some variables that are not statistically significant.

Quantile Regression

The best way of explaining quantile regression is to quote the experts. Authors Cook and Manning, in their paper entitled “Thinking beyond the mean” for the Shanghai Archives of Psychiatry, wrote:

"Quantile regression is not a regression estimated on a quantile, or subsample of data as the name may suggest….

In OLS regression, the goal is to minimize the distances between the values predicted by the regression line and the observed values. In contrast, quantile regression differentially weights the distances between the values predicted by the regression line and the observed values, then tries to minimize the weighted distances. Alternatively, this can be viewed as weighting the distances between the values predicted by the regression line and the observed values below the line (negative residuals) by 0.5, and weighting the distances between the values predicted by the regression line and the observed values above the line (positive residuals) by 1.75. Doing so ensures that minimization occurs when 75 percent of the residuals are negative."

Because quantile regression strives to estimate parameters at different deciles, the assumption of normality is not required. Because of this, I think it will be interesting to understand whether or not socio-economic and demographic factors vary at differing quantiles of cumulative cases. Meaning, do counties with higher coronavirus cases exhibit different behavior than those with the lowest cases for the same variables?

Alt text Alt text Alt text Alt text Alt text

If you’ve never seen this plot before, don’t be scared - it’s actually quite simple. There are some variables with significant differences at the lower an upper percentiles, like female head of household percentage for example. For each percentage point of increase in female head of household variable in the .1 quantile, there about 0.7 more cases. This number at the 0.9 quantile is about 2.

Check out some of the other variables and see what interesting relationships you can find. You can also perform an ANOVA test to test whether or not the coefficients are statistically different at different deciles. Because we are doing a joint test of equality of slopes, which is a test of equality on all the coefficients, you can think of it as a “slope homogeneity test.”

\[ H_{o}: the \; slopes \; are \; homogenous \] \[ H_{1}: non \; H_{o} \]

## Quantile Regression Analysis of Deviance Table
## 
## Model: cases ~ population + hospitals + airport + prim_phys + med_home_val_000s + unemp_rt + hyperten_deathrt_perhundthou_over35 + med_hh_inc_000s + perc_sev_housing + femalehd_perc + perc_poverty + foodstmp_perc + pop65up_perc + nohsdip_perc + perc_wo_ins_und65 + cardio_deathrt_per_hundthou_ov35 + cost_percap_medicare_hd + gini + diabetes_perc + hd_among_ins_perc + airpm2.5conc + urban_rural
## Joint Test of Equality of Slopes: tau in {  0.1 0.9  }
## 
##   Df Resid Df F value                Pr(>F)    
## 1 24     4792  6.3954 < 0.00000000000000022 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

With a p-value of p<0, we reject the null hypothesis and retain the alternative hypothesis. This supports the fact that counties at the higher deciles for coronavirus cases have different parameters for socio-economic and demographic variables than the lower deciles.

Conclusion

Analyses are like points of view - you may see a data set in an entirely different light than even your closest friend. My analysis is by no means complete - if you saw something really important missing go ahead and download the data for your region and add onto it from your own unique perspective!

In conclusion, the things I went over were:

  • Exploratory analysis in R
  • Interactive graphs and maps in R
  • Modeling exponential growth

Please keep in mind that you should always go through your variables thoroughly, using both descriptive statistics and visualizations, before doing any type of analysis. I held back here because I wanted to show what was possible without going into too much detail - if you’re like me, you could probably deal with the same data set forever and still find other ways to examine/present your findings.

Some ways you can add on or improve to this type of analysis are:

  • Explore death rates for coronavirus
  • Create an interactive graph for the infamous “curve” everyone speaks about flattening
  • Talk about other socio-economic and demographic aspects related to the coronavirus, such as this one
  • Perform a more complete logistic growth, linear or quantile regression

And here are some more references to get you going with visualizations in R with some examples of the pages I used.

And of course all the other helpful ones like Stackoverflow and R Graph Gallery.

If you have any questions, feel free to let me know and thank you for your time!